{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Explore Data In Redshift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install SQL Alchemy\n",
    "!pip install -q SQLAlchemy==1.3.13"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set Redshift Connection Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "redshift_schema = \"redshift\"\n",
    "redshift_cluster_identifier = \"dsoaws\"\n",
    "redshift_host = \"dsoaws\"\n",
    "redshift_database = \"dsoaws\"\n",
    "redshift_port = \"5439\"\n",
    "redshift_table_2015 = \"amazon_reviews_tsv_2015\"\n",
    "redshift_table_2014 = \"amazon_reviews_tsv_2014\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the Redshift Secrets from Secrets Manager"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import boto3\n",
    "\n",
    "secretsmanager = boto3.client(\"secretsmanager\")\n",
    "\n",
    "secret = secretsmanager.get_secret_value(SecretId=\"dsoaws_redshift_login\")\n",
    "cred = json.loads(secret[\"SecretString\"])\n",
    "\n",
    "redshift_username = cred[0][\"username\"]\n",
    "redshift_pw = cred[1][\"password\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "redshift = boto3.client(\"redshift\")\n",
    "\n",
    "response = redshift.describe_clusters(ClusterIdentifier=redshift_cluster_identifier)\n",
    "\n",
    "redshift_endpoint_address = response[\"Clusters\"][0][\"Endpoint\"][\"Address\"]\n",
    "\n",
    "print(redshift_endpoint_address)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the Redshift Connection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import awswrangler as wr\n",
    "\n",
    "con_redshift = wr.data_api.redshift.connect(\n",
    "    cluster_id=redshift_cluster_identifier,\n",
    "    database=redshift_database,\n",
    "    db_user=redshift_username,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using APPROXIMATE COUNT for 'blazing fast' results\n",
    "\n",
    "The COUNT function counts the rows defined by the expression.\n",
    "\n",
    "The COUNT function has three variations. COUNT ( * ) counts all the rows in the target table whether they include nulls or not. COUNT ( expression ) computes the number of rows with non-NULL values in a specific column or expression. COUNT ( DISTINCT expression ) computes the number of distinct non-NULL values in a column or expression.\n",
    "\n",
    "When used with APPROXIMATE, a COUNT ( DISTINCT expression ) function uses a HyperLogLog algorithm to approximate the number of distinct non-NULL values in a column or expression. Queries that use the APPROXIMATE keyword execute much faster, with a low relative error of around 2%. Approximation is warranted for queries that return a large number of distinct values, in the millions or more per query, or per group, if there is a group by clause. For smaller sets of distinct values, in the thousands, approximation might be slower than a precise count. APPROXIMATE can only be used with COUNT ( DISTINCT )."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compare the query execution times of the two queries below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "df = wr.data_api.redshift.read_sql_query(\n",
    "    sql=\"\"\"SELECT approximate count(distinct customer_id)\n",
    "                        FROM {}.{}\n",
    "                        GROUP BY product_category\"\"\".format(\n",
    "        redshift_schema, redshift_table_2015\n",
    "    ),\n",
    "    con=con_redshift,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "df = wr.data_api.redshift.read_sql_query(\n",
    "    sql=\"\"\"SELECT count(distinct customer_id)\n",
    "                                FROM {}.{}\n",
    "                                GROUP BY product_category\"\"\".format(\n",
    "            redshift_schema, redshift_table_2015\n",
    "    ),\n",
    "    con=con_redshift,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's do some Visualizations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format='retina'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "statement = \"\"\"\n",
    "SELECT product_category,\n",
    "COUNT(star_rating) AS count_star_rating\n",
    "FROM {}.{}\n",
    "GROUP BY product_category\n",
    "ORDER BY count_star_rating DESC\n",
    "\"\"\".format(\n",
    "    redshift_schema, redshift_table_2015\n",
    ")\n",
    "\n",
    "print(statement)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = wr.data_api.redshift.read_sql_query(\n",
    "    sql=statement,\n",
    "    con=con_redshift,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store number of categories\n",
    "num_categories = df.shape[0]\n",
    "print(num_categories)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store max ratings\n",
    "max_ratings = df[\"count_star_rating\"].max()\n",
    "print(max_ratings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set size and style to use\n",
    "if num_categories > 10:\n",
    "    plt.figure(figsize=(10, 10))\n",
    "else:\n",
    "    plt.figure(figsize=(10, 5))\n",
    "\n",
    "plt.style.use(\"seaborn-whitegrid\")\n",
    "\n",
    "# Create Seaborn barplot\n",
    "barplot = sns.barplot(y=\"product_category\", x=\"count_star_rating\", data=df, saturation=1)\n",
    "\n",
    "# Set title\n",
    "plt.title(\"Number of Ratings per Product Category (Redshift)\")\n",
    "\n",
    "# Set x-axis ticks to match scale from 10mio reviews to 20mio reviews\n",
    "if max_ratings <= 8000:\n",
    "    plt.xticks(\n",
    "        [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000],\n",
    "        [\"10K\", \"20K\", \"30K\", \"40K\", \"50K\", \"60K\", \"70K\", \"80K\"],\n",
    "    )\n",
    "    plt.xlim(0, 80000)\n",
    "elif max_ratings <= 200000:\n",
    "    plt.xticks([50000, 100000, 150000, 200000], [\"50K\", \"100K\", \"1500K\", \"200K\"])\n",
    "    plt.xlim(0, 200000)\n",
    "elif max_ratings > 200000:\n",
    "    plt.xticks([100000, 1000000, 5000000, 10000000, 15000000, 20000000], [\"100K\", \"1m\", \"5m\", \"10m\", \"15m\", \"20m\"])\n",
    "    plt.xlim(0, 20000000)\n",
    "\n",
    "plt.xlabel(\"Number of Ratings\")\n",
    "plt.ylabel(\"Product Category\")\n",
    "\n",
    "plt.tight_layout()\n",
    "\n",
    "# Export plot if needed\n",
    "# plt.savefig('ratings_per_category.png', dpi=300)\n",
    "\n",
    "# Show the barplot\n",
    "plt.show(barplot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query Athena using Redshift Spectrum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "athena_schema = \"athena\"\n",
    "athena_table_name = \"amazon_reviews_tsv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "statement = \"\"\"\n",
    "SELECT product_category, COUNT(star_rating) AS count_star_rating\n",
    "FROM {}.{}\n",
    "GROUP BY product_category\n",
    "ORDER BY count_star_rating DESC\n",
    "\"\"\".format(\n",
    "    athena_schema, athena_table_name\n",
    ")\n",
    "\n",
    "print(statement)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = wr.data_api.redshift.read_sql_query(\n",
    "    sql=statement,\n",
    "    con=con_redshift,\n",
    ")\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set size and style to use\n",
    "if num_categories > 10:\n",
    "    plt.figure(figsize=(10, 10))\n",
    "else:\n",
    "    plt.figure(figsize=(10, 5))\n",
    "\n",
    "plt.style.use(\"seaborn-whitegrid\")\n",
    "\n",
    "# Create Seaborn barplot\n",
    "barplot = sns.barplot(y=\"product_category\", x=\"count_star_rating\", data=df, saturation=1)\n",
    "\n",
    "# Set title\n",
    "plt.title(\"Number of Ratings per Product Category (Athena via Redshift Spectrum)\")\n",
    "\n",
    "# Set x-axis ticks to match scale from 10mio reviews to 20mio reviews\n",
    "# Set x-axis ticks to match scale from 10mio reviews to 20mio reviews\n",
    "if max_ratings <= 8000:\n",
    "    plt.xticks(\n",
    "        [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000],\n",
    "        [\"10K\", \"20K\", \"30K\", \"40K\", \"50K\", \"60K\", \"70K\", \"80K\"],\n",
    "    )\n",
    "    plt.xlim(0, 80000)\n",
    "elif max_ratings <= 200000:\n",
    "    plt.xticks([50000, 100000, 150000, 200000], [\"50K\", \"100K\", \"1500K\", \"200K\"])\n",
    "    plt.xlim(0, 200000)\n",
    "elif max_ratings > 200000:\n",
    "    plt.xticks([100000, 1000000, 5000000, 10000000, 15000000, 20000000], [\"100K\", \"1m\", \"5m\", \"10m\", \"15m\", \"20m\"])\n",
    "    plt.xlim(0, 20000000)\n",
    "\n",
    "plt.xlabel(\"Number of Ratings\")\n",
    "plt.ylabel(\"Product Category\")\n",
    "\n",
    "plt.tight_layout()\n",
    "\n",
    "# Export plot if needed\n",
    "# plt.savefig('ratings_per_category.png', dpi=300)\n",
    "\n",
    "# Show the barplot\n",
    "plt.show(barplot)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Release Resources"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "\n",
    "try {\n",
    "    Jupyter.notebook.save_checkpoint();\n",
    "    Jupyter.notebook.session.delete();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%html\n",
    "\n",
    "<p><b>Shutting down your kernel for this notebook to release resources.</b></p>\n",
    "<button class=\"sm-command-button\" data-commandlinker-command=\"kernelmenu:shutdown\" style=\"display:none;\">Shutdown Kernel</button>\n",
    "\n",
    "<script>\n",
    "try {\n",
    "    els = document.getElementsByClassName(\"sm-command-button\");\n",
    "    els[0].click();\n",
    "}\n",
    "catch(err) {\n",
    "    // NoOp\n",
    "}\n",
    "</script>"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}